CRC has the 3rd highest mortality rate
More effective methods to detect CRC are needed
Correlation between exosomes and tumorigenesis
miRNA and mRNA can serve as biomarkers - these we want to find!
Using the library GEOquery, the data was loaded -> no need to download any files
Both primary data and meta data was loaded
Data was already standardized
| miRNA from the GEO data | Frequency of the different states of cancer |
Majority of data comes from patients with earlier stage cancer
The patients are usually elderly people
Fetch analyte.tsv & clinical.tsv from raw_/
Library TCGABiolinks is used to retrieve data from the GDC data portal
Function: retrieve_and_prepare()
GDCquery: Query to specify the data to get
GDCdownload: Downloading the samples from the query
Example:
miRNA data - 2 separate dataframes
mRNA data - Large SummarizedExperiment
| Distribution of Age by Gender | Frequency of AJCC Pathologic Stages by Status |
| Log2 transformed data, miRNA | Log2 transformed data, mRNA |
Calculation of the normalization factors for the data (_log) with calcNormFactors and imputation of NAs using means.
Running a “universal” edgeR differential analysis function with a quasi likelihood model.
| Statistics table | Final _aug dataset |
|---|---|
| TCGA mRNA | TCGA miRNA | GSE miRNA |
|---|---|---|
| TCGA mRNA | TCGA miRNA | GSE miRNA |
|---|---|---|
Although we were able to follow the article’s instructions, there are significant differences in our results. It might be brought on by some extra measures taken during data preprocessing, or by the authors’ sparse information. It would be wise to get in touch with the authors to inquire further about preprocessing and data retrieval.
Overall, our analysis was carried out accurately, and the results did not indicate any grave errors.
In addition, data we used has different amount of sample in each stages, and stages differ between TCGA and GSE datasets
R for Bio Data Science